42 research outputs found

    How Many Corn Traits Do You Need?

    Get PDF
    We seem to live in a “have it your way” world. Everything from fast food hamburgers to “designer” clothing tailored in countries some of us didn’t know existed. In contrast to this, modern, mass-production systems encourage wholesale consumption of identical products worldwide: “one-size fits all.” These contradictions also exist in the world of corn production

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

    Full text link
    Today, data analysts largely rely on intuition to determine whether missing or withheld rows of a dataset significantly affect their analyses. We propose a framework that can produce automatic contingency analysis, i.e., the range of values an aggregate SQL query could take, under formal constraints describing the variation and frequency of missing data tuples. We describe how to process SUM, COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard error bounds with testable constraints. We propose an optimization algorithm based on an integer program that reconciles a set of such constraints, even if they are overlapping, conflicting, or unsatisfiable, into such bounds. Our experiments on real-world datasets against several statistical imputation and inference baselines show that statistical techniques can have a deceptively high error rate that is often unpredictable. In contrast, our framework offers hard bounds that are guaranteed to hold if the constraints are not violated. In spite of these hard bounds, we show competitive accuracy to statistical baselines

    Serializability, not Serial: Concurrency Control and Availability in Multi-Datacenter Datastores

    Full text link
    We present a framework for concurrency control and availability in multi-datacenter datastores. While we consider Google's Megastore as our motivating example, we define general abstractions for key components, making our solution extensible to any system that satisfies the abstraction properties. We first develop and analyze a transaction management and replication protocol based on a straightforward implementation of the Paxos algorithm. Our investigation reveals that this protocol acts as a concurrency prevention mechanism rather than a concurrency control mechanism. We then propose an enhanced protocol called Paxos with Combination and Promotion (Paxos-CP) that provides true transaction concurrency while requiring the same per instance message complexity as the basic Paxos protocol. Finally, we compare the performance of Paxos and Paxos-CP in a multi-datacenter experimental study, and we demonstrate that Paxos-CP results in significantly fewer aborted transactions than basic Paxos.Comment: VLDB201

    Decibel: the relational dataset branching system

    Get PDF
    As scientific endeavors and data analysis become increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these short-comings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.National Science Foundation (U.S.) (1513972)National Science Foundation (U.S.) (1513407)National Science Foundation (U.S.) (1513443)Intel Science and Technology Center for Big Dat

    Herbicide-Resistance in Turf Systems: Insights and Options for Managing Complexity

    Get PDF
    Due to complex interactions between social and ecological systems, herbicide resistance has classic features of a “wicked problem.” Herbicide-resistant (HR) Poa annua poses a risk to sustainably managing U.S. turfgrass systems, but there is scant knowledge to guide its management. Six focus groups were conducted throughout the United States to gain understanding of socio-economic barriers to adopting herbicide-resistance management practices. Professionals from major turfgrass sectors (golf courses, sports fields, lawn care, and seed/sod production) were recruited as focus-group participants. Discussions emphasized challenges of the weed management of turfgrass systems as compared to agronomic crops. This included greater time constraints for managing weeds and more limited chemical control options. Lack of understanding about the proper use of compounds with different modes of action was identified as a threat to sustainable weed management. There were significant regional differences in perceptions of the existence, geographic scope, and social and ecological causes of HR in managing Poa annua. Effective resistance management will require tailoring chemical and non-chemical practices to the specific conditions of different turfgrass sectors and regions. Some participants thought it would be helpful to have multi-year resistance management programs that are both sector- and species-specific
    corecore